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with  a  new  retrieval  approach  based  on  language 
improvements  for  PTO  data. 


models  that  may  also  yield  retrieval 


Important  Findings  and  Conclusions 
None. 


Significant  Hardware  Development 


None 

Special  Comments 
None. 

Implication  for  Further  Research 

We  will  continue  to  improve  the  retrieval 
through  better  use  of  phrases. 


performance  of  the  demonstration 


system 


Task  2:  Browsing  aad  Classiflcado.  Techniques  for  Document  CoUectious 

Task  Objectives 
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Technical  Problems 
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Technical  Results 


In  the  classification  work  of  the  last  3  months,  we  have  continued  to  do  classification 
experiments  based  on  the  patent  class  hierarchy.  This  work  is  described  in  a  new  paper. 

In  the  summarization/visualization  area,  we  have  implemented  a  prototype  of  a  system 
that  summarizes  using  a  concept  hierarchy.  Experiments  were  carried  out  that  tested 
whether  the  relationships  found  were  meaningful.  This  work  was  reported  in  a  paper 
written  for  the  SIGIR  conference. 

Important  Findings  and  Conclusions 

None. 

Significant  Hardware  Development 
None 

Special  Comments 
None. 

Implication  for  Further  Research 

We  will  continue  to  improve  the  demonstration  system  and  plan  to  carry  out  further 
classification  experiments. 

Task  3:  Image  Indexing  and  Retrieval 

Task  Objectives 

The  goal  of  this  task  is  to  develop  similarity-based  techniques  for  retrieving  images  such 
as  trademarks,  logos,  and  designs. 

Technical  Problems 

The  central  issue  is  how  images  can  be  indexed  to  support  efficient,  content-based 
retrieval.  The  primary  type  of  query  in  these  environments  is  “find  me  things  that  look 
like  this”.  We  are  developing  "appearance-based"  retrieval  of  images  as  well  as  more 
straightforward  features  such  as  color  and  texture.  Filter  based  and  frequency  domain 
based  techniques  offer  some  potential  in  this  area,  but  significant  work  needs  to  be  done 
on  making  this  approach  efficient  enough  to  deal  with  hundreds  of  thousands  of  images. 

General  Methodology 
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The  evaluahon  of  these  techniques  will  be  done  in  a  similar  way  to  text  by  developing 
test  collections  of  images.  Specifically,  we  are  working  to  obtain  large  collections  of 

trademark  and  design  images,  both  from  the  PTO  and  from  general  sources  such  as  the 
web. 

Technical  Results 

We  have  incorporated  a  relevance  feedback  technique  into  the  trademark  retrieval 
system.  This  technique  allows  the  searcher  to  indicate  which  trademarks  are  examples  of 
the  types  of  images  that  are  relevant.  Based  on  these  examples,  the  system  updates  the 
original  query  and  produces  a  new  ranking.  We  have  begun  experiments  to  test  the 
effectiveness  of  this  technique.  We  have  also  incorporated  the  ability  to  directly  input  a 
query  image  for  the  demonstration  system.  We  continue  to  carry  out  experiments  on 
appearance-based  techniques  that  combine  local  and  global  features.  We  have  also 
evaluated  a  technique  that  may  be  able  to  be  used  to  separate  text  from  design  in  mixed 

trademarks.  The  technique  was  evaluated  by  segmenting  words  in  handwritten 
manuscripts. 

Important  Findings  and  Conclusions 

Relevance  feedback  is  a  viable  technique  for  trademark  retrieval.  Widely  varying  images 
of  text  can  be  accurately  segmented. 

Significant  Hardware  Development 

None 

Special  Comments 
None. 

Implications  for  Further  Research 

We  continue  to  focus  on  evaluating  the  accuracy  of  our  techniques  using  trademark 
testbeds  from  Britain  and,  hopefully,  from  the  U.S.  PTO.  We  will  also  continue  to 
improve  the  demonstration  system. 


Task  4;  Distributed  Retrieval  Architecture 
Task  Objectives 

The  goals  of  this  task  are  to  scale  up  our  current  methods  of  automatically  selecting 
collections  and  merging  results,  and  to  investigate  architectures  that  can  support  efficient 
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retrieval,  browsing  and  relevance  feedback  in  distributed  environments  with  terabytes  of 
information. 

Technical  Problems 


The  current  INQUERY  text  retrieval  system  uses  a  client  server  architecture  to  support 
simultaneous  retrieval  from  multiple  collections  distributed  across  one  or  more 
processors.  A  number  of  efficiency  bottlenecks  develop,  however,  when  the  size  of  the 
databases  is  very  large.  Deciding  which  subcollections  to  search  can  address  part  of  the 
problem,  but  there  are  other  problems  associated  with  the  fundamental  efficiency  of  the 
processes  involved  and  the  use  of  distributed  resources.  Image  indexing  and  retrieval 
tends  to  make  all  of  these  problems  worse  since  the  databases  and  indexes  are 
considerably  larger. 

General  Methodology 


The  arcWtectures  and  algorithms  produced  in  this  task  will  be  evaluated  using  a 
combination  of  standard  performance  (efficiency)  measures  and  effectiveness  measures, 
pie  efficiency  tests  will  be  done  using  TREC  data  and  large  PTO  databases,  including 
images,  and  the  collection  selection  algorithms  will  be  evaluated  using  the  text 
subcollections  of  the  patents. 

Technical  Results 

A  new  technique  for  collection  selection  in  distributed  retrieval  was  evaluated  and 
described  in  a  paper.  This  technique  is  particularly  appropriate  for  an  environment  like 
the  PTO  that  has  very  large  databases  and  control  over  how  those  databases  are 
partitioned.  The  technique  creates  language  or  topic  models  through  clusters  and  bases 
the  partition  on  those  models.  The  retrieval  results  show  that  it  is  even  possible  to 
outperform  centralized  retrieval  using  this  technique.  Partitioning  the  databases  by  patent 
classes  is  similar  to  this  technique  and  we  have  begun  discussing  the  implication  of  these 
results  with  Dataware. 


We  have  also  carried  out  experiments  and  described  a  technique  for  data  replication  for 
large  distributed  databases  that  results  in  performance  improvements.  We  are  currently 
considering  how  these  two  new  results  can  be  integrated  into  the  PTO  environment. 

Important  Findings  and  Conclusions 

Distributed  search  can  be  more  effective  than  centralized  search  if  it  is  based  on  language 
models.  Replication  can  significantly  improve  the  performance  (response  time)  of  a  large 
distributed  system. 

Significant  Hardware  Development 
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None. 


Special  Comments 
None 

Implications  for  Further  Research 

We  will  continue  to  evaluate  performance  of  distributed  architectures  for  scalable  IR 
using  the  new  demonstration  system. 
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